Lec16 - Mon 3/20: Importing Data

Recall from the First Lecture

Data/Science Pipeline

Drawing

How do I import my own data into R?

  • Not difficult, but it still takes practice.
  • You might need to do this for your final projects.

How do I import my own data into R?

  • Excel .xlsx files are clunky as they have lots of Microsoft metadata we don’t need. Can use readxl package to load Excel files
  • Comma-separated values .csv files are a minimalist spreadsheet format.

What is a CSV file?

A .csv file (example) is just data and no fluff:

  • Rows are separated by line breaks.
  • Values for a given row (i.e. variables) are separated by commas. Each row has equal number of commas.
  • The first row is typically a header row with the column/variable names

Today’s Exercise 1: Load a CSV into R

Today you will load DD_vs_SB.csv file that contains the Dunkin Donuts and Starbucks data. Delaney Moran scraped data from the web for

  • the number of Dunkin Donuts and Starbucks
  • median income

in each of 1024 census tracts in 6 Eastern Massachusetts counties

Drawing

Today’s Exercise 1: Load CSV into RStudio

  1. In the RStudio File Panel -> Navigate to the file -> Click on it and select -> “Import Dataset…”
  2. Make sure “Heading” is set to “Yes”. This tells RStudio that the first row are the variable names.
  3. Click Import
  4. The View() panel should pop up with the data. Make sure that the variable names are correct.
  5. Plot this data!

Today’s Exercise 2: Get Started with R Markdown

  • Start Problem Set 06 in R Markdown format
  • Biggest source of confusion: R Markdown has it’s own environment. Just because something exists in your console, doesn’t mean it exists in R Markdown.
  • R Markdown Debugging first

Lec15 - Fri 3/17: 5MV#5 arrange() & _join

Today: Five Main Verbs

  1. filter() rows/observations matching criteria
  2. summarize() numerical variables
  3. group_by() group rows/observations by a categorical variable
  4. mutate() existing variables to create new ones
  5. arrange() rows

And _join!

Arrange

Really simple. Either

  • DATASET_NAME %>% arrange(VARIABLE_NAME) or
  • DATASET_NAME %>% arrange(desc(VARIABLE_NAME))

Arrange Example

library(dplyr)

# Create data frame with two variables
test_data <- data_frame(
  name=c("Abbi", "Abbi", "Ilana", "Ilana", "Ilana"),
  value_1=c(0, 1, 0, 1, 0),
  value_2=c(4, 6, 3, 2, 5)
)

# See contents in console
test_data

Arrange Example

Run this code. Notice the subtle diff between 2 and 3:

# 1: Arrange in ascending order
test_data %>% 
  arrange(value_1)

# 2: Arrange in descending order
test_data %>% 
  arrange(desc(value_1))

# 3: Arrange in decending order of value_1, and then within
# value_1, arrange in ascending order of value_2
test_data %>% 
  arrange(desc(value_1), value_2)

Combining Data Sets via Joins

And now the last component of data wrangling: joining/merging two data sets. Run the following:

x <- data_frame(x1=c("A","B","C"), x2=c(1,2,3))
y <- data_frame(x1=c("A","B","D"), x3=c(TRUE,FALSE,TRUE))
x
y

Combining Data Sets via Joins

We join by the "x1" variable. Note how it is in quotation marks.

left_join(x, y, by = "x1")
full_join(x, y, by = "x1")

Extra on Joins

  • In Chapter 5.3.2, there is an example of joining when variable names are different in the two data sets.
  • There are many types of join (right-hand column of back of cheatsheet). To keep things simple, we’ll try to only use:
    • left_join
    • full_join
  • This illustration succinctly summarizes all of them.

Lec14 - Thu 3/16: 5MV#3 group_by() & 5MV#4 mutate()

Today: Five Main Verbs

  1. filter() rows/observations matching criteria
  2. summarize() numerical variables
  3. group_by() group rows/observations by a categorical variable
  4. mutate() existing variables to create new ones
  5. arrange() rows

Grouping Example

Run the following in your console:

library(dplyr)

# Create data frame with two variables
test_data <- data_frame(
  name=c("Albert", "Albert", "Albert", "Yolanda", "Yolanda"),
  value=c(2, 2, 2, 3, 3)
)

# See contents in console
test_data

Grouping

  • Say we don’t want the overall average, but averages for Albert and Yolanda separately. i.e. grouped by name.
  • group_by(name) puts grouping meta-data
  • meta-data is data about data; it doesn’t change the actual data

Grouping Example

Run the following. Notice the data itself doesn’t change, but the data about the data does:

test_data

test_data %>% 
  group_by(name)

Grouping Example

Run both these

test_data %>% 
  summarise(overall_avg = mean(value))

test_data %>% 
  group_by(name) %>% 
  summarise(name_avg = mean(value))

What’s the difference?

Grouping then Summarizing

Chalk talk

5MV#3 Grouping + 5MV#2 Summarize:

Here:

  • Grey, blue, green rows are in the same group
  • For each group, summarize numerical values i.e. many-to-one
Drawing Drawing

5MV#4 Mutate

Mutate existing variables to create new ones. Always of the form:

DATASET_NAME %>% 
  mutate(NEW_VARIABLE_NAME = OLD_VARIABLE_NAMES)

Example

Using the same example as earlier. Run both:

test_data %>% 
  mutate(double_value = value * 2)

test_data %>% 
  mutate(double_value = value * 2) %>% 
  mutate(triple_value = value + double_value)

Lec13 - Piping %>%, 5MV#1 filtering, and 5MV#2 summarize()

Piping

  • R Command: %>%
  • Pronounced: “then
  • Keyboard shortcuts:
    • macOS: COMMAND+SHIFT+M
    • PC: CTRL+SHIFT+M

Piping

Piping allows you to

  1. Take the output of one function and pipe it as the input of the next
  2. You can string along several pipes to form a single chain
  3. See Chalk Talk

Today: Five Main Verbs

  1. filter() rows/observations matching criteria
  2. summarize() numerical variables
  3. group_by() group rows/observations by a categorical variable
  4. mutate() existing variables to create new ones
  5. arrange() rows

5MV#1 Filter

filter() rows/observations matching criteria

Drawing

Filter Example

Take flights and then filter for all rows where year is equal to 2014.

Note we use == and not =

library(dplyr)
library(nycflights13)
data(flights)

flights %>% 
  filter(year == 2014)

5MV#2 Summarize

summarize() numerical variables using a many to one function:

Drawing

5MV#2 Summarize

Examples of many to one functions:

  • sum(): sum of n values
  • mean(): mean of n values
  • sd(): standard deviation of n values
  • See backside of cheatsheet -> Summarize Data -> Summary functions

Summarize Example

What’s going here?

library(dplyr)
library(nycflights13)
data(weather)

weather %>% 
  summarize(mean_temp = mean(temp))

Lec12 - Mon 3/13: Intro to Data Wrangling

Switching Gears

With the internet, we are in a new age of data:

Bridging the Gap

  • Jenny Bryan at UBC teaches a graduate level class STAT 545 on Data wrangling, exploration, and analysis with R. Note the ordering.
  • Drawing

Classroom vs Real Data

Jenny Bryan said: “Classroom data are like teddy bears and real data are like a grizzly bear with salmon blood dripping out its mouth.”

Traditional Classroom Data Real Data
Drawing Drawing

Real Data

Some attributes of real data:

  • Often not in a format ready for analysis
  • Messy and needs cleaning
  • Typos, weird outliers
  • Missing values
  • Inconsistent formatting

Real Data

Inconsistent formatting is a real pain:

  • Dates: “2016/10/12” vs “2016-10-12” vs “10/12/16” vs “10/12/2016” vs “Oct 12, 2016”
  • “DC” vs “D.C.” vs “District of Columbia”
  • “Beyonce” vs “Beyoncé”

dplyr Package

To take this, we now officially introduce the dplyr package: a grammar of data manipulation

Drawing

Pedogical Note

  • Were it not for this package, I probably wouldn’t be taking a data-centric view to this course.
  • The verb describing the action you want to perform on your data IS the name of the function() you use.
  • So you don’t need extensive programming experience (indexing, for loops, etc) to be able to manipulate data.

5MV

Say hello to the 5MV: the five main verbs

  1. filter() rows/observations matching criteria
  2. summarize() numerical variables
  3. group_by() group rows/observations by a categorical variable
  4. mutate() existing variables to create new ones
  5. arrange() rows

Also, later _join() two separate data frames by corresponding variables

Lec11 - Thu 3/9: 5NG#5 Barplots

Today

  1. Scatterplot AKA bivariate plot
  2. Line-graph
  3. Histogram
  4. Boxplot
  5. Barplot AKA Barchart AKA bargraph

Barplots

Recall from first Grammar of Graphics lecture, we displayed

Exercise

Say these piecharts represent polls for a local election with 5 candidates at time points A, B, and C:

Drawing

Answer the following questions:

  • In the first race, is candidate 5 doing better than candidate 4?
  • Who did better between time A and time B, candidate 2 or candidate 4?

Exercise

Drawing

Barplots

  • y-axis: Both histograms and barplots display notions of relative frequency/counts
  • x-axis:
    • Histogram: continuous variable
    • Barplot: categorical variable
  • geom_bar() is the trickiest of the 5NG, so we’ll use it in limited capacity.

Chief Difficulty with Barplots

Two different ways to have counts show on y-axis:

  • Computed internally by geom_bar()
  • Precomputed manually by yourself in your data in a variable count, n, etc.

Example

Counts are not pre-computed:

Row Number name
1 Albert
2 Albert
3 Albert
4 Mo
5 Mo

Example

Counts are pre-computed in variable n. So n becomes a y aesthetic variable!

name n
Albert 3
Mo 2

Lec10 - Mon 3/6: Midterm

Administrative

  • In-class Wed 3/8
  • Closed book, no calculators

Philosophy

  • More conceptual in nature
  • Code: you won’t need to write code, but you will need to understand it.
  • Normal curve of distribution of difficulty

Sources

  • Lectures 01 through 09 inclusive
    • Slides from each lecture
    • Corresponding textbook material
    • Learning Checks
    • PS-03

Major Topics

  • Tidy data. What are the components?
  • What is the Grammar of Graphics? How do they tie in with ggplot2?
  • What are the first four of the 5NG? What are their distinguishing features?

Practice Midterm

  • Disclaimer, disclaimer, disclaimer
  • Do not overly interpret the content of this midterm.
  • Rather, view it to get a rough sense of my exam philosophy.

Lec09 - Thu 3/2: 5NG#4 Boxplots

Today

  1. Scatterplot AKA bivariate plot
  2. Line-graph
  3. Histogram
  4. Boxplot
  5. Barplot AKA Barchart AKA bargraph

Example

If I know your name, I can guess your age. Looking at the handout answer the following questions:

As of Jan 1st, 2014 in the United States

  1. What can you say about females named Ella vs Zoe?
  2. What can you say about males named Aidan vs Oliver?
  3. What proportion of male Connors are younger than 16?
  4. What proportion of female Gertrudes are older than 69?

Statistics Terminology

  • The \(p^{th}\) percentile means p% of observations fall below it.
  • Ex: If 30 years old is the 40th percentile of age, then 40% of people are 30 or younger.
  • The horizontal bars indicate the 3 quartiles
    • 1st quartile = 25th percentile:
    • 2nd quartile = 50th percentile AKA median. It is a measure of center.
    • 3rd quartile i.e. 75th percentile
  • The width of the bars (3rd quartile - 1st quartile) is the interquartile range (IQR)
    • It contains the middle 50% of observations.
    • It is a measure of spread/variability.

Boxplots

Chalk Talk: Age of 544 Members of 113th United States Congress:

  • 439 members of House of Representatives
  • 105 Senators

Why Boxplots?

  • The babynames example of today are boxplots without the whiskers
  • Boxplots, just like histograms, show distributions. But IMO they are better for comparing multiple distributions with a single line.
  • Ex: Planet Money article. In this case, you can compare cities with a single vertical line.

Lec08 - Wed 3/1: 5NG#3 Histograms + Facets

Today

  1. Scatterplot AKA bivariate plot
  2. Line-graph
  3. Histogram
  4. Boxplot
  5. Barplot AKA Barchart AKA bargraph

Recall

From okcupiddata package, the profiles data set:

Recall

Restricted to heights between 55 (5’5’‘) and 80 (6’8’’) inches:

What Histograms Do

  • The y-axis displays notions of relative frequency i.e. which values occur more than others.
  • Huge definition: they are a visualization of the statistical distribution of values.

How Do I Construct Them?

  • We have an x aesthetic
  • Counts on the y-axis not an explicit variable in the data set, but rather are computed internally. i.e. No y aesthetic
  • The shape of a histogram is dependent on the structure of the bins on the x-axis.

Chalk Talk:

For values: \(-2.5, -1.5, -0.5, 0.5, 1.5, 2.5\)

Let’s draw histograms using the following binning structures:

  1. (-3, -2, -1, 0, 1, 2, 3)
  2. (-4, -2, 0, 2, 4)
  3. (-4, 4)

Facets

Facets allow you split ANY plot by a categorical variable. In this case by adding +facet_wrap(~sex) to the ggplot() call

Lec07 - Mon 2/27: 5NG#2 Linegraphs

Today

  1. Scatterplot AKA bivariate plot
  2. Line-graph
  3. Histogram
  4. Boxplot
  5. Barplot AKA Barchart AKA bargraph

Recall Example Data

Example

A statistical graphic is a mapping of data variables to aes()thetic attributes of geom_etric objects.

ggplot(data=simple_ex, aes(x=A, y=B, size=C, color=D )) + 
  geom_line()

Lec06 - Fri 2/24: 5NG#1 Scatterplots

Today

  1. Scatterplot AKA bivariate plot
  2. Line-graph
  3. Histogram
  4. Boxplot
  5. Barplot AKA Barchart AKA bargraph

Today

What’s not great about this plot, especially near (0, 0)?

Overplotting

This is called overplotting: when points are stacked so densely we can’t see what’s going on!

There are two ways of dealing with this:

  1. Make points a little more transparent
  2. Jiggle the points a little

Lec05 - Thu 2/23: More 5NG

Refresher: The Grammar of Graphics

A statistical graphic is a mapping of data variables to aes()thetic attributes of geom_etric objects.

Refresher: 5NG

The five named graphs we’ll see in this class. Note: I reordered them from last time to be easiest to hardest to work with:

  1. Scatterplot AKA bivariate plot
  2. Line-graph
  3. Histogram
  4. Boxplot
  5. Barplot AKA Barchart AKA bargraph

Data Visualization via ggplot2 Package

  • We are building up to doing data visualization in R via the ggplot2 package
  • Last time we reverse-engineered the grammar from graphical outputs
  • Today we (forward) engineer them

Today’s Data

In tidy format:

A B C D
1 1 3 Hot
2 2 2 Hot
3 3 1 Cold
4 4 2 Cold

Lec04 - Wed 2/22: 5NG

What is a statistical graphic?

  • Today we kick off Topic 2.b) Data Visualization by asking ourselves: What is a statistical graphic?
  • But a brief lesson from military history first

Napoleon’s March on Russia in 1812

In 1812, Napoleon led a French invasion of Russia, marching on Moscow.

Drawing

Napoleon’s March on Russia in 1812

It was one of the biggest military disasters ever, in particular b/c of the Russian winter.

Drawing

Minard’s Illustration of the March

Famous graphical illustration of Napolean’s march to/from Moscow

Drawing

Minard’s Illustration of the March

This was considered a revolution in statistical graphics because between

  • the map on top
  • the line graph on the bottom

there are 6 dimensions of information (i.e. variables) being displayed on a 2D page.

The Grammar of Graphics

A statistical graphic is a mapping of data variables to aes()thetic attributes of geom_etric objects.

Minard’s Illustration of the March

Where? data aes() geom_
top map longitude x point
latitude y point
army size size path
army direction (forward vs retreat) color path
bottom graph date x line & text
temperature y line & text

Grammar of Graphics

2005 - Proposal 2009 - R Implementtation

Name this Graph

From ggplot2movies package, the movies data set:

Name this Graph

From nycflights13 package, the flights data set:

Name this Graph

From okcupiddata package, the profiles data set:

Name this Graph

From fueleconomy package, the vehicles data set:

Name this Graph

From babynames package, the babynames data set:

5NG

Say hello to the 5NG: the five named graphs

  1. Scatterplot AKA bivariate plot
  2. Line-graph
  3. Histogram
  4. Boxplot
  5. Barplot AKA Barchart AKA bargraph

Lec03 - Mon 2/20: Tidy Data

What is Tidy Data?

  • There are many ways to organize data. Today we learn one way: the “tidy data” format.
  • It is rather simple, but deceptively powerful.
  • Equivalent to “long format”

What is Tidy Data?

Drawing

  1. Each observation forms a row
  2. Each variable forms a column
  3. Each type of observational unit forms a table

What is Tidy Data (Advanced)?

  1. Each observation forms a row: In other words, each row corresponds to a single observational unit
  2. Each variable forms a column:
    • Some of the variables may be used to identify the observational units. For organizational purposes, it’s generally better to put these in the left-hand columns
    • Some of the variables may be observed values associated with each observational unit
  3. Each type of observational unit forms a table: Don’t mix apples and oranges, keep apples with apples and oranges with oranges

nycflights13 Package

The nycflights13 package contains “tidy data” all 336,776 flights that departed from NYC (e.g. EWR, JFK and LGA) in 2013.

To help understand what causes delays, it also includes a number of other useful datasets.

  • weather: hourly meterological data for each airport
  • planes: construction information about each plane
  • airports: airport names and locations
  • airlines: translation between two letter carrier codes and names

Lec02 - Thu 2/16: R Packages

Exercise

In small teams, take 3 minutes to write down

  1. A couple of male and female names that are “modern”
  2. A couple of male and female names that are “old-fashioned”
  3. One male and one female name that are “back in vogue”

Learning R

  • Computers are stupid! You need to:
    • Tell it exactly and everything it needs to do
    • Everything needs to be perfect:
      • Write everything from scratch
      • Names of “stuff” need to typed exactly
      • Parentheses need to match
  • Recall: This is not a class on programming/coding. However, we’ll learn just enough to do statistics and data science
  • Side Benefit: Many of the concepts translate to almost all programming languages: python, javascript, etc.

Learning R

Recall the tradeoff:

Less of this… More of this…
Drawing Drawing

What are R Packages?

  • Base R, i.e. R straight out of the box. It’s fairly limited in power and functionality.
  • R Packages are extensions to R that are
    • contributed by a world-wide community of R users
    • extend base R’s functionality
    • are downloadable over the internet from RStudio.

Step 1: How Do I Install a Package?

You need to install each package once.

  • In RStudio: Go to Files Panel -> Packages -> Install
  • Type in the package name and click install
  • The procedure for updating a package is the same

Step 2: How Do I Load a Package?

You need to load a package everytime you want to use it.

  • Run library(PACKAGENAME) in the console.

Baby’s First R Packages

Today’s Learning Check: Install and then load 3 packages:

  • dplyr: a package for data manipulation
  • ggplot2: a package for data visualization
  • babynames: a package of baby name data

babynames Package

The babynames package contains for each year from 1880 to 2013, the number of children born of each sex given each name in the United States. Only names with more than 5 occurrences are considered.

Lec01 - Mon 2/13: Introduction

Course Title

  • In catalog: Introduction to Statistical Sciences
  • New: Introduction to Statistical and Data Sciences

What is Data Science?

Data Science

  • Example domains: biology, economics, physics, sociology, etc.
  • So why the title switch?

Dialogue with Student

Course Objective #1

Have students engage in the data/science research pipeline in as faithful a manner as possible while maintaining a level suitable for novices.

  • Cobb: Minimizing prerequisites to research
  • Not necessarily publishing in top journals, but answering scientific questions with data.
  • Difficult to do research without understanding stats, however

Data/Science Research Pipeline

We will, as best we can, perform all this:

Data/Science Research Pipeline

And not just this, as in many previous intro stats courses:

Course Objective #2

Foster a conceptual understanding of statistical topics and methods using simulation/resampling and real data whenever possible, rather than mathematical formulae.

  • Whenever we can, use real data
  • Example data set: nycflights13
  • There are two “engines” that can make statistics “work”
    • Mathematics: formulas, approximations, etc
    • Computers: simulations, random number generation

The “Engine” of Statistics

In this course, computers and not math will be the “engine”. What does this mean?

  • Less of this:
    Drawing
  • But more of this:
    Drawing

Programming/Coding

  • Previous programming/coding experience is not a prerequisite to this course
  • This course is not an explicit course on programming, coding, nor computer science. But we will use some elements.
  • Also you will be exposed to basic algorithmic thinking and computational logic
  • Learning R is like learning a foreign language: its really hard at first!

Two Simple Rules of Learning Code

  • Computers are stupid!
  • When learning, take existing code that works, and tweak it!

Course Objective #3

Blur the traditional lecture/lab dichotomy of introductory statistics courses by incorporating more computational and algorithmic thinking into the syllabus.

  • Completely separate lecture and labs is a legacy of a time before
    Drawing

RStudio Server

  • Not all laptops are created equal: operating system, processing power, age
  • RStudio Server: cloud-based version of RStudio where all processing is done on Middlebury servers
  • go/rstudio/ (on campus or via VPN)

Course Objective #5

Develop statistical literacy by, among other ways, tying in the curriculum to current events, demonstrating the importance statistics plays in society.

  • H.G. Wells (paraphrased): “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.”
  • Me: “Sure, it’s easy to lie with statistics. But it’s also hard to tell the truth without them.”

Final Project

  • Capstone experience to align this topics and principles of this course with how research and learning is done in practice.
  • Work on interpersonal and collaborative skills. No textbook on that!

Lecture Format

Either

  • Lab format: With laptop
    • You sit in groups of 4
    • I’ll talk for 10-15 minutes before you work on learning checks
  • Chalk talk: Old-school
    • Keep desk in rows
    • More traditional lecture format

Let’s Build our Toolbox

R, RStudio, and DataCamp

  • R: Software behind the scenes i.e. the engine
  • RStudio: Intergrated development environment i.e. the interface
  • DataCamp: Browser-based learning tool i.e. the driver’s ed teacher

Analogy

R RStudio DataCamp
Drawing Drawing Drawing

Test Drive RStudio

  • Login to go/rstudio/ with your Midd account
  • If you don’t have access, raise your hand. (Username: guest1, password: rstudioguest)
  • In RStudio menu bar -> File -> New File -> R Script

The Four Panels

  1. Console: Crunch numbers in R
  2. Files, Packages, Help: See your files, install packages, help files
  3. Editor: Where you’ll write code and save it
  4. Environment: Your workspace

Important: Console

  • This is where you run/execute commands
  • The “>” is the prompt. It means R is ready to receive commands
  • If you don’t see a “>” and want to restart, press ESC.

Switching Gears

Now we will use R via DataCamp instead of via RStudio, but just for driver’s ed. Two panels exist in both:

  1. Editor panel: Where you write code
  2. Console panel: Where you will execute code